Introduction to Web Scraping

Posted May 6, 2024

By Anh Dinh 3 min read

1. What is Web Scraping?
2. Common Types of Websites
3. Static & Dynamic Websites
4. What is the DOM Structure?
5. Goals and Applications of Web Scraping
6. Popular Python Libraries
7. Real Example: Scraping Books from books.toscrape.com

1. What is Web Scraping?

Web scraping is the automated process of collecting data from websites using programs — instead of manually copying data line by line. We can write a few lines of code to get hundreds or thousands of data items in just minutes.

2. Common Types of Websites

Websites are often classified by several criteria:

By dynamism: Static vs Dynamic websites
By frontend/backend technologies: React, Vue, Django, Laravel, etc.
By code architecture: Monolith, Microservices, etc.
By rendering technologies: SSR, CSR, hybrid

In this introduction, we only focus on classification by dynamism.

3. Static & Dynamic Websites

Static Websites:

Use only HTML and CSS; content is “fixed” — doesn’t change per visitor.
Easy to scrape since content is already in the HTML.
Examples: Portfolio sites, product landing pages.

Dynamic Websites:

Use backend processing — often PHP, Node.js, Python, etc.
Content changes based on user interaction or is loaded by JavaScript.
Harder to scrape because you must wait for the page to fully load.
Examples: Shopee, Facebook, real-time price tracking sites.

4. What is the DOM Structure?

The DOM (Document Object Model) is a tree structure of a web page. Each HTML tag is a node, which can be a parent or child of other nodes.

Simple example:

  
<body>
  <h1>Title</h1>
  <p>Description here</p>
</body>

In this example:

<body> is the parent
<h1> and <p> are children

Larger DOM example:

  
<html>
  <head>
    <title>Trang A</title>
  </head>
  <body>
    <div class="header">
      <h1>Welcome</h1>
    </div>
    <div class="content">
      <ul>
        <li>Sách 1</li>
        <li>Sách 2</li>
      </ul>
    </div>
    <footer>Liên hệ</footer>
  </body>
</html>

<html> is the root node containing the entire webpage.
<head> contains page information like the title, not directly visible.
<title> is the page title shown on the browser tab.
<body> holds the main content visible to users.
Inside <body>, there are smaller parts called child nodes:
- <div class="header"> contains the main header <h1>Welcome</h1>.
- <div class="content"> contains a list of books with <li> items.
- <footer> is the footer section with the text “Contact”.

This structure is like a tree, each tag is a branch or leaf, helping us easily find and extract data when scraping websites.

5. Goals and Applications of Web Scraping

🎯 Main goals:

Automate data collection (fast, save effort)
Analyze and compare prices (products, crypto, flight tickets, etc.)
Track content changes (news, prices, rankings, etc.)
Create datasets for research, machine learning, statistics
Integrate into internal systems like dashboards or apps

6. Popular Python Libraries

requests – Send HTTP requests, fetch HTML content
BeautifulSoup (bs4) – Easy HTML parsing and extraction
lxml – Fast and powerful parser for HTML/XML
selenium – Automate interaction with dynamic (JS) sites
scrapy – Framework for large crawling projects
httpx – Similar to requests but supports async
pyppeteer, playwright – Headless browser control, good for JS-heavy sites

🛠 Choose libraries based on your goals. For static sites, requests + BeautifulSoup is usually enough.

7. Real example: Scraping books from books.toscrape.com

The site books.toscrape.com is a sample site for practicing web scraping.

It is a static website, ideal for beginners
Contains 1000 books spread across 50 pages
Simple URL structure:

https://books.toscrape.com/catalogue/page-{page_number}.html

Download source code

demo.ipynb

res = requests.get(url): Sends an HTTP request to get the webpage content at the given URL.
soup = BeautifulSoup(res.text, 'html.parser'): Parses the HTML content of the page using BeautifulSoup for easier processing.
books = soup.select(".col-xs-6.col-sm-4.col-md-3.col-lg-3"): Selects all HTML elements with the class "col-xs-6 col-sm-4 col-md-3 col-lg-3" — these are the tags containing information for each book on the page.

Each element in books is a “node” containing detailed information about a book, making it easy to extract details like title, image, rating, price, etc.

Web Crawling, Python, Data Engineering

This post is licensed under CC BY 4.0 by the author.